Overview

Real-world datasets often include values for many of variables. Our brains cannot efficiently process high-dimensional datasets to dervive with useful, actionable insights. In this post I will look at ways to

The three fundamental dimensionality reduction techniques that will be covered are

Principal component analysis (PCA)

As a data scientist, you’ll frequently have to

frequently dealing with messy and high-dimensional datasets is the bread and butter of any data scientist. In this section, I will cover Principal Component Analysis (PCA) to effectively reduce the dimensionality of any datasets so it is easier to extract actionable insights. The motivating reason why it is important to reduce dimensionality through techiniques such as PCA is to explain as much data variation as possible while discarding highly correlated variables.

Curse of Dimensionality

  • Dimensions: the number of columns in the dataset that represent features of observations

  • Dimensionality: the number of features (column)s characterizing the dataset

  • Observed vs True Dimensionality: observed features obscure the true or intrinsic dimensionality of the data.

Deal with the Curse of Dimensionality by removing redundancy.

Note: As the dimensionalities of the data grow, the feature space grows.

Data

The data used for this analysis is the 2004 New Car and Truck data submitted by 2004 New Car and Truck Data. The data can be found at JSE Data Archive.

This data set includes features of a number of brands of cars from 2004. The first step is to explore the dataset and attempt to draw useful conclusions from the correlation matrix. Correlation reveals feature resemblance and it is useful to infer how cars are related to each other based on their features’ values. The data consist of 387 observations and 21 variables.

PCA

  1. Pre-processing steps
  • Data Centering and Standardisation
  1. Change of coordinate system
  • Rotation and Projection
  1. Explained variance
  • Screeplot and the explained variance

  • Explore cars with summary()

summary(cars)
##  Vehicle.Name         Sports.Car          SUV             Wagon        
##  Length:387         Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  Class :character   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Mode  :character   Median :0.0000   Median :0.0000   Median :0.00000  
##                     Mean   :0.1163   Mean   :0.1525   Mean   :0.07494  
##                     3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##     Minivan            Pickup       AWD              RWD        
##  Min.   :0.00000   Min.   :0   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median :0   Median :0.0000   Median :0.0000  
##  Mean   :0.05168   Mean   :0   Mean   :0.2016   Mean   :0.2429  
##  3rd Qu.:0.00000   3rd Qu.:0   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :0   Max.   :1.0000   Max.   :1.0000  
##   Retail.Price     Dealer.Cost      Engine.Size         Cyl        
##  Min.   : 10280   Min.   :  9875   Min.   :1.400   Min.   : 3.000  
##  1st Qu.: 20997   1st Qu.: 19575   1st Qu.:2.300   1st Qu.: 4.000  
##  Median : 28495   Median : 26155   Median :3.000   Median : 6.000  
##  Mean   : 33231   Mean   : 30441   Mean   :3.127   Mean   : 5.757  
##  3rd Qu.: 39552   3rd Qu.: 36124   3rd Qu.:3.800   3rd Qu.: 6.000  
##  Max.   :192465   Max.   :173560   Max.   :6.000   Max.   :12.000  
##        HP           City.MPG      Highway.MPG        Weight    
##  Min.   : 73.0   Min.   :10.00   Min.   :12.00   Min.   :1850  
##  1st Qu.:165.0   1st Qu.:18.00   1st Qu.:24.00   1st Qu.:3107  
##  Median :210.0   Median :19.00   Median :27.00   Median :3469  
##  Mean   :214.4   Mean   :20.31   Mean   :27.26   Mean   :3532  
##  3rd Qu.:250.0   3rd Qu.:21.50   3rd Qu.:30.00   3rd Qu.:3922  
##  Max.   :493.0   Max.   :60.00   Max.   :66.00   Max.   :6400  
##    Wheel.Base        Length        Width       wheeltype         type    
##  Min.   : 89.0   Min.   :143   Min.   :64.00   AWD: 78   Minivan   : 20  
##  1st Qu.:103.0   1st Qu.:177   1st Qu.:69.00   RWD:309   Pickup    :234  
##  Median :107.0   Median :186   Median :71.00             Sports Car: 45  
##  Mean   :107.2   Mean   :185   Mean   :71.28             SUV       : 59  
##  3rd Qu.:112.0   3rd Qu.:193   3rd Qu.:73.00             Wagon     : 29  
##  Max.   :130.0   Max.   :221   Max.   :81.00
  • Correlation matrix

Correlation matrix is a matrix of correlation coefficients. Smaller number of dimensions translates to less complex correlation matrix.

## 
## Two-Step Estimates
## 
## Correlations/Type of Correlation:
##              Retail.Price Dealer.Cost Engine.Size
## Retail.Price            1     Pearson     Pearson
## Dealer.Cost        0.9991           1     Pearson
## Engine.Size        0.5994      0.5936           1
## 
## Standard Errors:
##              Retail.Price Dealer.Cost
## Retail.Price                         
## Dealer.Cost     8.919e-05            
## Engine.Size       0.03266     0.03301
## 
## n = 387 
## 
## P-values for Tests of Bivariate Normality:
##              Retail.Price Dealer.Cost
## Retail.Price                         
## Dealer.Cost           NaN            
## Engine.Size     2.952e-26   9.253e-26
  • PCA with base R’s prcomp()
## Standard deviations (1, .., p=18):
##  [1] 2.663310e+04 6.336531e+02 5.404142e+02 3.388520e+01 1.054945e+01
##  [6] 4.364139e+00 2.699232e+00 1.676887e+00 1.107866e+00 8.460880e-01
## [11] 3.794052e-01 3.141799e-01 2.854633e-01 2.612140e-01 2.452150e-01
## [16] 1.939418e-01 1.468818e-01 5.205969e-16
## 
## Rotation (n x k) = (18 x 18):
##                        PC1           PC2           PC3           PC4
## Sports.Car    4.664256e-06  1.376281e-04 -1.668030e-04  2.135659e-03
## SUV           3.926409e-07 -3.200880e-04  8.640143e-05 -1.330547e-03
## Wagon        -5.584797e-07  1.202712e-05  3.265437e-05 -4.483685e-04
## Minivan      -5.429529e-07 -9.596557e-05  3.046913e-05 -7.436825e-04
## Pickup        0.000000e+00  2.775558e-17 -5.551115e-17  2.220446e-16
## AWD           1.596014e-06 -2.380564e-04  5.065348e-05 -1.526639e-03
## RWD           7.571334e-06  1.070398e-04 -5.763259e-05  2.224008e-03
## Retail.Price  7.404744e-01 -2.381699e-01 -6.284570e-01 -3.694206e-03
## Dealer.Cost   6.719632e-01  2.799511e-01  6.856313e-01  1.364569e-03
## Engine.Size   2.274067e-05 -9.482492e-04  2.307113e-04  7.365899e-03
## Cyl           3.654059e-05 -1.090188e-03  3.431778e-04  1.078132e-02
## HP            2.200871e-03 -2.913309e-02  7.741448e-03  9.975386e-01
## City.MPG     -9.563723e-05  4.568842e-03 -1.660748e-03 -3.709087e-02
## Highway.MPG  -9.904539e-05  5.458085e-03 -2.036162e-03 -3.033492e-02
## Weight        1.257923e-02 -9.293940e-01  3.672072e-01 -3.090327e-02
## Wheel.Base    5.417952e-05 -7.588386e-03  4.129063e-03  1.400151e-02
## Length        1.036728e-04 -1.243324e-02  5.034887e-03  3.470477e-02
## Width         3.921017e-05 -3.930300e-03  9.522236e-04  8.055652e-03
##                        PC5           PC6           PC7           PC8
## Sports.Car   -9.277616e-03  7.043937e-03 -1.699490e-02  5.237911e-02
## SUV          -1.307656e-02  4.383552e-03 -5.742077e-03  2.471423e-03
## Wagon        -1.237310e-03  8.983243e-04 -3.028729e-03 -9.739920e-03
## Minivan       2.308389e-03 -9.682745e-04  2.106389e-02  3.450641e-02
## Pickup       -1.665335e-16 -5.551115e-17 -9.020562e-17 -2.567391e-16
## AWD          -1.418351e-02  5.831884e-03 -1.335326e-02 -2.516106e-02
## RWD           1.209627e-03  6.813950e-03  3.036752e-02  2.543403e-02
## Retail.Price  4.772513e-04 -5.704585e-05  6.755367e-04 -3.696670e-04
## Dealer.Cost  -3.095909e-04  1.397392e-04 -6.841106e-04  4.272107e-04
## Engine.Size   1.267252e-02  9.760385e-03 -1.145745e-02  3.514177e-02
## Cyl           1.611315e-02  1.800829e-02 -5.993707e-03  8.655241e-03
## HP           -3.409354e-02 -5.043907e-02 -1.111277e-02 -5.857513e-03
## City.MPG      2.838682e-02 -7.111340e-01 -1.387846e-01 -2.362369e-02
## Highway.MPG   1.106391e-01 -6.654923e-01 -1.408586e-01  6.085202e-03
## Weight       -1.396935e-02 -6.629515e-03 -4.673530e-03 -2.551353e-03
## Wheel.Base    3.433947e-01 -1.543644e-01  9.250228e-01 -6.601010e-03
## Length        9.264643e-01  1.562734e-01 -3.197702e-01 -9.680310e-02
## Width         9.193759e-02  1.848764e-04 -2.808757e-02  9.916271e-01
##                        PC9          PC10          PC11          PC12
## Sports.Car   -1.586417e-02  5.172408e-03 -1.939059e-01  9.397434e-02
## SUV           8.231795e-02 -1.991433e-02  2.179351e-01  2.824254e-01
## Wagon        -3.619365e-03 -1.357449e-02 -4.658996e-02 -3.436478e-01
## Minivan      -1.814420e-03 -2.876530e-02 -4.953584e-02 -6.357442e-03
## Pickup       -3.478121e-16 -1.265914e-15  7.494005e-16  5.520757e-16
## AWD           1.261696e-02 -8.292133e-02  6.105930e-01 -2.544330e-01
## RWD           3.491230e-03  1.541480e-01 -7.108886e-01 -1.641387e-01
## Retail.Price  8.860378e-06 -5.209333e-05 -7.094491e-06 -3.712337e-05
## Dealer.Cost   1.785525e-06  4.737114e-05  1.656075e-05  4.631813e-05
## Engine.Size   4.662106e-02  4.095714e-01  4.584568e-02  7.600677e-01
## Cyl           7.212737e-02  8.897488e-01  1.602103e-01 -3.488091e-01
## HP            2.032004e-04 -1.140233e-02  2.996188e-04 -1.696706e-03
## City.MPG      6.833208e-01 -4.169812e-02 -3.164487e-02 -2.100510e-02
## Highway.MPG  -7.172857e-01  7.018430e-02  2.716482e-02  1.967286e-02
## Weight       -2.002600e-03 -3.204444e-04 -8.603018e-04 -1.699555e-04
## Wheel.Base    1.478122e-02 -7.773695e-04  2.991902e-02  8.423462e-03
## Length        5.605310e-02 -2.680689e-02 -9.281434e-03 -4.846287e-03
## Width         2.485617e-02 -3.180864e-02  4.004066e-02 -3.600757e-02
##                       PC13          PC14          PC15          PC16
## Sports.Car   -6.826348e-02  1.341929e-02 -4.065804e-01  8.831591e-01
## SUV           1.321703e-01  5.005382e-01  4.773684e-01  2.408485e-01
## Wagon        -4.052110e-01 -5.395164e-01  5.697403e-01  2.658700e-01
## Minivan       8.836167e-02 -3.675634e-01 -4.350037e-01 -2.002194e-01
## Pickup        3.094801e-15  1.514671e-15 -2.868799e-16 -4.713244e-15
## AWD          -6.231493e-01  2.693695e-01 -2.988965e-01 -2.702333e-02
## RWD          -4.348980e-01  4.397817e-01 -7.517149e-03 -1.843286e-01
## Retail.Price  3.173141e-05 -2.100314e-05  2.855642e-05 -1.737747e-05
## Dealer.Cost  -3.305076e-05  2.083567e-05 -2.769418e-05  1.897891e-05
## Engine.Size  -4.271171e-01 -2.309297e-01  1.260485e-02 -9.829235e-02
## Cyl           2.141708e-01  4.291264e-02 -2.723250e-02  7.107806e-02
## HP            8.993003e-04  5.786233e-04  8.280510e-04 -1.471095e-03
## City.MPG     -4.770321e-03 -2.403292e-02 -2.984405e-02 -1.124977e-03
## Highway.MPG  -5.624382e-03  3.508978e-02  2.704478e-02  5.494734e-03
## Weight        2.001988e-04 -5.533337e-05 -1.782101e-04 -1.416688e-05
## Wheel.Base   -5.229849e-03  1.741793e-03  1.767598e-03  2.933672e-02
## Length       -4.071112e-03  8.389702e-03 -5.870570e-03  4.480046e-03
## Width         4.326170e-03  8.945077e-03  3.198558e-02 -3.020290e-02
##                       PC17          PC18
## Sports.Car   -7.837203e-04  4.431569e-15
## SUV          -5.580682e-01  2.003898e-16
## Wagon        -1.699023e-01  3.820757e-15
## Minivan      -7.891813e-01  5.861165e-17
## Pickup        8.300076e-16  1.000000e+00
## AWD          -5.840771e-02  9.998765e-16
## RWD          -1.603820e-01  7.132248e-16
## Retail.Price  1.562904e-05 -6.711892e-19
## Dealer.Cost  -1.607443e-05  1.005849e-18
## Engine.Size   5.493540e-02  1.293071e-15
## Cyl          -3.863841e-02  9.606336e-16
## HP           -1.368756e-03 -1.297282e-16
## City.MPG      2.575204e-02  2.291183e-16
## Highway.MPG  -3.837248e-02 -2.446454e-16
## Weight        1.089482e-04  7.928040e-19
## Wheel.Base    1.381722e-02  1.935772e-16
## Length       -1.107457e-02 -7.725097e-17
## Width         2.809927e-02  4.496021e-17
  • PCA with FactoMineR’s PCA()

PCA for the 10 non-binary numeric variables of car. PCA generates 2 graphs and extracts the first 5 PCs.

- Summary of the first 100 cars

Extracting summaries of a subset of the rows in a dataset can be done with the nbelements argument.

## 
## Call:
## PCA(X = cars[, 9:19], ncp = 4, graph = T) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               7.105   1.884   0.850   0.357   0.275   0.198
## % of var.             64.588  17.127   7.725   3.246   2.504   1.799
## Cumulative % of var.  64.588  81.714  89.439  92.685  95.189  96.988
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.087   0.066   0.037   0.001
## % of var.              1.277   0.788   0.604   0.336   0.007
## Cumulative % of var.  98.266  99.053  99.657  99.993 100.000
## 
## Individuals (the 100 first)
##                  Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1            |  4.568 | -4.533  0.747  0.985 | -0.290  0.012  0.004 |
## 2            |  4.977 | -4.791  0.835  0.927 | -0.771  0.082  0.024 |
## 3            |  3.415 | -3.208  0.374  0.882 |  0.614  0.052  0.032 |
## 4            |  3.450 | -3.262  0.387  0.894 |  0.526  0.038  0.023 |
## 5            |  3.367 | -3.159  0.363  0.881 |  0.528  0.038  0.025 |
## 6            |  3.915 | -3.791  0.523  0.937 |  0.276  0.010  0.005 |
## 7            |  3.860 | -3.733  0.507  0.935 |  0.221  0.007  0.003 |
## 8            |  3.694 | -3.647  0.484  0.975 | -0.001  0.000  0.000 |
## 9            |  4.020 | -3.951  0.568  0.966 |  0.066  0.001  0.000 |
## 10           |  3.639 | -3.591  0.469  0.974 | -0.107  0.002  0.001 |
## 11           |  3.594 | -3.547  0.458  0.974 | -0.093  0.001  0.001 |
## 12           |  4.584 | -4.399  0.704  0.921 |  0.252  0.009  0.003 |
## 13           |  5.443 | -4.896  0.872  0.809 |  0.231  0.007  0.002 |
## 14           |  4.416 | -4.203  0.643  0.906 |  0.241  0.008  0.003 |
## 15           |  4.762 | -4.696  0.802  0.972 | -0.387  0.021  0.007 |
## 16           |  4.715 | -4.647  0.785  0.972 | -0.436  0.026  0.009 |
## 17           |  4.687 | -4.621  0.777  0.972 | -0.430  0.025  0.008 |
## 18           |  3.401 | -3.360  0.411  0.976 |  0.354  0.017  0.011 |
## 19           |  3.357 | -3.318  0.400  0.977 |  0.280  0.011  0.007 |
## 20           |  3.324 | -3.288  0.393  0.978 |  0.295  0.012  0.008 |
## 21           |  2.306 | -1.838  0.123  0.635 |  1.192  0.195  0.267 |
## 22           |  4.570 | -4.488  0.733  0.965 | -0.405  0.023  0.008 |
## 23           |  4.442 | -4.324  0.680  0.948 | -0.437  0.026  0.010 |
## 24           |  3.424 | -3.364  0.412  0.965 |  0.357  0.018  0.011 |
## 25           |  3.373 | -3.318  0.400  0.968 |  0.303  0.013  0.008 |
## 26           |  3.336 | -3.285  0.392  0.969 |  0.255  0.009  0.006 |
## 27           |  5.292 | -4.848  0.855  0.839 | -1.249  0.214  0.056 |
## 28           |  3.984 | -3.938  0.564  0.977 |  0.146  0.003  0.001 |
## 29           |  3.903 | -3.856  0.541  0.976 |  0.074  0.001  0.000 |
## 30           |  2.980 | -2.884  0.302  0.937 |  0.479  0.031  0.026 |
## 31           |  3.514 | -3.336  0.405  0.901 |  0.620  0.053  0.031 |
## 32           |  3.412 | -3.246  0.383  0.905 |  0.460  0.029  0.018 |
## 33           |  3.370 | -3.205  0.374  0.905 |  0.387  0.021  0.013 |
## 34           |  3.267 | -3.115  0.353  0.909 |  0.541  0.040  0.027 |
## 35           |  3.225 | -3.075  0.344  0.909 |  0.468  0.030  0.021 |
## 36           |  5.562 | -5.329  1.033  0.918 | -0.980  0.132  0.031 |
## 37           |  3.361 | -3.275  0.390  0.950 | -0.196  0.005  0.003 |
## 38           |  3.311 | -3.230  0.380  0.952 | -0.277  0.011  0.007 |
## 39           |  3.293 | -3.221  0.377  0.957 |  0.406  0.023  0.015 |
## 40           |  3.066 | -2.930  0.312  0.913 |  0.252  0.009  0.007 |
## 41           |  4.592 | -4.327  0.681  0.888 |  0.203  0.006  0.002 |
## 42           |  4.565 | -4.296  0.671  0.885 |  0.170  0.004  0.001 |
## 43           |  4.560 | -4.289  0.669  0.885 |  0.158  0.003  0.001 |
## 44           |  6.267 | -5.987  1.304  0.913 | -0.839  0.096  0.018 |
## 45           |  5.774 | -5.605  1.143  0.942 | -0.875  0.105  0.023 |
## 46           |  6.248 | -5.963  1.293  0.911 | -0.860  0.101  0.019 |
## 47           |  1.478 | -0.232  0.002  0.025 |  1.218  0.203  0.679 |
## 48           |  1.893 | -0.056  0.000  0.001 |  1.519  0.316  0.644 |
## 49           |  2.583 | -2.264  0.186  0.768 |  0.882  0.107  0.117 |
## 50           |  1.320 | -0.689  0.017  0.272 |  0.523  0.038  0.157 |
## 51           |  1.826 | -0.157  0.001  0.007 |  1.431  0.281  0.614 |
## 52           |  2.815 | -2.584  0.243  0.842 | -0.170  0.004  0.004 |
## 53           |  2.716 | -2.475  0.223  0.830 | -0.361  0.018  0.018 |
## 54           |  2.144 | -1.609  0.094  0.563 |  1.147  0.180  0.286 |
## 55           |  1.166 | -0.565  0.012  0.235 |  0.714  0.070  0.375 |
## 56           |  2.229 |  0.297  0.003  0.018 |  1.834  0.461  0.677 |
## 57           |  2.078 | -1.444  0.076  0.483 |  1.158  0.184  0.311 |
## 58           |  2.032 | -1.410  0.072  0.481 |  1.089  0.163  0.287 |
## 59           |  2.987 | -2.674  0.260  0.801 | -0.429  0.025  0.021 |
## 60           |  1.737 | -0.204  0.002  0.014 |  1.458  0.292  0.705 |
## 61           |  1.469 |  0.183  0.001  0.015 |  1.156  0.183  0.619 |
## 62           |  2.609 | -2.262  0.186  0.752 |  0.781  0.084  0.090 |
## 63           |  2.528 | -2.173  0.172  0.739 |  0.680  0.063  0.072 |
## 64           |  4.211 | -4.001  0.582  0.903 |  0.120  0.002  0.001 |
## 65           |  3.347 | -3.218  0.377  0.925 | -0.560  0.043  0.028 |
## 66           |  7.282 | -5.722  1.191  0.617 |  0.210  0.006  0.001 |
## 67           | 11.382 | -8.695  2.750  0.584 | -0.962  0.127  0.007 |
## 68           |  1.409 | -0.669  0.016  0.226 |  0.817  0.092  0.336 |
## 69           |  1.361 | -0.644  0.015  0.224 |  0.772  0.082  0.322 |
## 70           |  1.402 | -0.741  0.020  0.280 |  0.842  0.097  0.361 |
## 71           |  2.369 | -2.163  0.170  0.834 |  0.679  0.063  0.082 |
## 72           |  1.785 | -0.133  0.001  0.006 |  1.465  0.294  0.673 |
## 73           |  4.704 | -4.117  0.616  0.766 | -1.532  0.322  0.106 |
## 74           |  2.054 | -1.236  0.056  0.362 |  1.050  0.151  0.261 |
## 75           |  2.814 | -2.598  0.245  0.852 | -0.192  0.005  0.005 |
## 76           |  2.543 | -2.267  0.187  0.795 |  0.871  0.104  0.117 |
## 77           |  1.270 | -0.694  0.018  0.298 |  0.477  0.031  0.141 |
## 78           |  1.811 |  0.408  0.006  0.051 |  1.447  0.287  0.639 |
## 79           |  2.915 | -2.824  0.290  0.939 |  0.373  0.019  0.016 |
## 80           |  1.350 | -0.743  0.020  0.303 |  0.534  0.039  0.157 |
## 81           |  2.538 | -2.321  0.196  0.836 | -0.247  0.008  0.009 |
## 82           |  1.988 | -1.716  0.107  0.745 |  0.413  0.023  0.043 |
## 83           |  1.590 | -0.803  0.023  0.255 |  1.010  0.140  0.404 |
## 84           |  2.369 | -1.974  0.142  0.694 |  0.978  0.131  0.170 |
## 85           |  0.987 | -0.500  0.009  0.256 |  0.553  0.042  0.314 |
## 86           |  2.401 | -1.762  0.113  0.538 |  1.210  0.201  0.254 |
## 87           |  1.141 | -0.064  0.000  0.003 |  0.772  0.082  0.458 |
## 88           |  9.020 | -6.178  1.388  0.469 |  0.342  0.016  0.001 |
## 89           |  3.468 | -3.326  0.402  0.920 | -0.373  0.019  0.012 |
## 90           |  3.144 | -3.000  0.327  0.910 | -0.588  0.047  0.035 |
## 91           |  5.701 | -4.823  0.846  0.716 | -0.144  0.003  0.001 |
## 92           |  3.498 | -3.287  0.393  0.883 | -0.776  0.083  0.049 |
## 93           |  2.810 | -2.655  0.256  0.892 | -0.654  0.059  0.054 |
## 94           |  1.766 | -1.538  0.086  0.759 | -0.028  0.000  0.000 |
## 95           |  2.248 | -2.036  0.151  0.820 |  0.062  0.001  0.001 |
## 96           |  1.389 | -1.033  0.039  0.553 | -0.242  0.008  0.030 |
## 97           |  1.797 |  0.713  0.019  0.158 |  1.372  0.258  0.583 |
## 98           |  1.464 |  0.265  0.003  0.033 |  1.051  0.152  0.515 |
## 99           |  1.399 |  0.803  0.023  0.329 |  0.731  0.073  0.273 |
## 100          |  1.718 |  0.428  0.007  0.062 |  1.291  0.229  0.565 |
##               Dim.3    ctr   cos2  
## 1            -0.113  0.004  0.001 |
## 2            -0.451  0.062  0.008 |
## 3             0.828  0.209  0.059 |
## 4             0.798  0.194  0.053 |
## 5             0.875  0.233  0.068 |
## 6             0.741  0.167  0.036 |
## 7             0.769  0.180  0.040 |
## 8            -0.132  0.005  0.001 |
## 9             0.280  0.024  0.005 |
## 10           -0.074  0.002  0.000 |
## 11           -0.084  0.002  0.001 |
## 12            1.187  0.428  0.067 |
## 13            2.248  1.536  0.170 |
## 14            1.267  0.489  0.082 |
## 15           -0.201  0.012  0.002 |
## 16           -0.176  0.009  0.001 |
## 17           -0.181  0.010  0.001 |
## 18            0.278  0.023  0.007 |
## 19            0.318  0.031  0.009 |
## 20            0.308  0.029  0.009 |
## 21           -0.054  0.001  0.001 |
## 22           -0.573  0.100  0.016 |
## 23           -0.767  0.179  0.030 |
## 24           -0.278  0.024  0.007 |
## 25           -0.249  0.019  0.005 |
## 26           -0.223  0.015  0.004 |
## 27           -0.309  0.029  0.003 |
## 28            0.424  0.055  0.011 |
## 29            0.461  0.065  0.014 |
## 30            0.143  0.006  0.002 |
## 31            0.471  0.067  0.018 |
## 32            0.558  0.095  0.027 |
## 33            0.598  0.109  0.031 |
## 34            0.589  0.105  0.032 |
## 35            0.628  0.120  0.038 |
## 36            0.262  0.021  0.002 |
## 37           -0.500  0.076  0.022 |
## 38           -0.456  0.063  0.019 |
## 39           -0.370  0.042  0.013 |
## 40           -0.599  0.109  0.038 |
## 41            1.474  0.661  0.103 |
## 42            1.491  0.676  0.107 |
## 43            1.498  0.682  0.108 |
## 44            1.270  0.491  0.041 |
## 45            0.654  0.130  0.013 |
## 46            1.281  0.499  0.042 |
## 47            0.342  0.036  0.054 |
## 48            0.856  0.223  0.205 |
## 49            0.564  0.097  0.048 |
## 50            0.180  0.010  0.019 |
## 51            0.822  0.205  0.203 |
## 52           -0.852  0.221  0.092 |
## 53           -0.748  0.170  0.076 |
## 54            0.135  0.006  0.004 |
## 55            0.062  0.001  0.003 |
## 56            0.764  0.178  0.118 |
## 57           -0.189  0.011  0.008 |
## 58           -0.151  0.007  0.006 |
## 59           -0.998  0.303  0.112 |
## 60            0.037  0.000  0.000 |
## 61           -0.063  0.001  0.002 |
## 62            0.828  0.209  0.101 |
## 63            0.882  0.237  0.122 |
## 64            1.210  0.445  0.083 |
## 65           -0.415  0.052  0.015 |
## 66            4.115  5.150  0.319 |
## 67            6.361 12.306  0.312 |
## 68           -0.505  0.077  0.128 |
## 69           -0.480  0.070  0.124 |
## 70           -0.461  0.065  0.108 |
## 71            0.338  0.035  0.020 |
## 72            0.119  0.004  0.004 |
## 73           -0.802  0.196  0.029 |
## 74           -0.259  0.020  0.016 |
## 75           -0.670  0.137  0.057 |
## 76            0.394  0.047  0.024 |
## 77           -0.093  0.003  0.005 |
## 78            0.567  0.098  0.098 |
## 79            0.201  0.012  0.005 |
## 80           -0.170  0.009  0.016 |
## 81           -0.836  0.212  0.108 |
## 82           -0.530  0.085  0.071 |
## 83           -0.463  0.065  0.085 |
## 84            0.592  0.107  0.062 |
## 85            0.104  0.003  0.011 |
## 86            0.712  0.154  0.088 |
## 87            0.126  0.005  0.012 |
## 88            5.561  9.403  0.380 |
## 89           -0.670  0.136  0.037 |
## 90           -0.513  0.080  0.027 |
## 91            2.558  1.990  0.201 |
## 92           -0.672  0.137  0.037 |
## 93           -0.173  0.009  0.004 |
## 94           -0.098  0.003  0.003 |
## 95           -0.095  0.003  0.002 |
## 96           -0.270  0.022  0.038 |
## 97            0.648  0.128  0.130 |
## 98            0.482  0.071  0.108 |
## 99            0.188  0.011  0.018 |
## 100           0.661  0.133  0.148 |
## 
## Variables
##                 Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## Retail.Price |  0.703  6.956  0.494 | -0.643 21.950  0.414 |  0.235  6.501
## Dealer.Cost  |  0.699  6.881  0.489 | -0.645 22.104  0.416 |  0.237  6.618
## Engine.Size  |  0.925 12.046  0.856 |  0.021  0.024  0.000 |  0.044  0.223
## Cyl          |  0.891 11.168  0.793 | -0.107  0.609  0.011 |  0.075  0.663
## HP           |  0.849 10.151  0.721 | -0.401  8.539  0.161 |  0.070  0.583
## City.MPG     | -0.828  9.640  0.685 |  0.005  0.001  0.000 |  0.493 28.629
## Highway.MPG  | -0.817  9.400  0.668 |  0.015  0.012  0.000 |  0.552 35.880
## Weight       |  0.896 11.312  0.804 |  0.230  2.804  0.053 | -0.103  1.259
## Wheel.Base   |  0.710  7.087  0.503 |  0.574 17.487  0.329 |  0.244  6.994
## Length       |  0.684  6.594  0.468 |  0.561 16.680  0.314 |  0.318 11.882
## Width        |  0.789  8.765  0.623 |  0.429  9.790  0.184 |  0.081  0.767
##                cos2  
## Retail.Price  0.055 |
## Dealer.Cost   0.056 |
## Engine.Size   0.002 |
## Cyl           0.006 |
## HP            0.005 |
## City.MPG      0.243 |
## Highway.MPG   0.305 |
## Weight        0.011 |
## Wheel.Base    0.059 |
## Length        0.101 |
## Width         0.007 |
  • Variance and cumulative variance of the first 3 new dimensions
##           eigenvalue percentage of variance
## comp 1  7.1046384308           64.587622098
## comp 2  1.8839247679           17.126588799
## comp 3  0.8497282852            7.724802592
## comp 4  0.3570154894            3.245595359
## comp 5  0.2754355932            2.503959939
## comp 6  0.1979437155            1.799488322
## comp 7  0.1405192086            1.277447350
## comp 8  0.0866388119            0.787625563
## comp 9  0.0663879807            0.603527097
## comp 10 0.0369773622            0.336157838
## comp 11 0.0007903547            0.007185043
##         cumulative percentage of variance
## comp 1                           64.58762
## comp 2                           81.71421
## comp 3                           89.43901
## comp 4                           92.68461
## comp 5                           95.18857
## comp 6                           96.98806
## comp 7                           98.26550
## comp 8                           99.05313
## comp 9                           99.65666
## comp 10                          99.99281
## comp 11                         100.00000
##                  Dim.1        Dim.2       Dim.3       Dim.4
## Retail.Price 0.4942292 4.135222e-01 0.055242376 0.027966771
## Dealer.Cost  0.4888778 4.164186e-01 0.056233118 0.029554656
## Engine.Size  0.8558593 4.437324e-04 0.001892595 0.098541461
## Cyl          0.7934611 1.147121e-02 0.005637083 0.146141962
## HP           0.7211734 1.608659e-01 0.004957347 0.001215251
## City.MPG     0.6848793 2.134397e-05 0.243270722 0.012387933
## Highway.MPG  0.6678118 2.264843e-04 0.304880981 0.005645006
## Weight       0.8036585 5.283288e-02 0.010700651 0.005102206
## Wheel.Base   0.5034900 3.294459e-01 0.059426280 0.017390311
## Length       0.4684884 3.142384e-01 0.100967748 0.010135761
## Width        0.6227096 1.844381e-01 0.006519384 0.002934173
##                  Dim.1        Dim.2      Dim.3      Dim.4
## Retail.Price  6.956430 21.950039964  6.5011813  7.8334894
## Dealer.Cost   6.881107 22.103781152  6.6177765  8.2782559
## Engine.Size  12.046487  0.023553613  0.2227294 27.6014525
## Cyl          11.168213  0.608899472  0.6633983 40.9343478
## HP           10.150740  8.538871564  0.5834038  0.3403916
## City.MPG      9.639890  0.001132952 28.6292367  3.4698587
## Highway.MPG   9.399659  0.012021939 35.8798202  1.5811655
## Weight       11.311744  2.804404780  1.2593026  1.4291273
## Wheel.Base    7.086778 17.487209278  6.9935627  4.8710243
## Length        6.594120 16.679985586 11.8823570  2.8390255
## Width         8.764832  9.790099701  0.7672316  0.8218616
## $Dim.1
## $Dim.1$quanti
##              correlation       p.value
## Engine.Size    0.9251267 5.116179e-164
## Weight         0.8964700 3.638599e-138
## Cyl            0.8907643 6.262131e-134
## HP             0.8492193 8.059693e-109
## Width          0.7891195  1.664116e-83
## Wheel.Base     0.7095703  1.671006e-60
## Retail.Price   0.7030143  5.914584e-59
## Dealer.Cost    0.6991979  4.510056e-58
## Length         0.6844621  8.580649e-55
## Highway.MPG   -0.8171975  3.651747e-94
## City.MPG      -0.8275744  1.404114e-98
## 
## 
## $Dim.2
## $Dim.2$quanti
##              correlation      p.value
## Wheel.Base     0.5739738 2.730836e-35
## Length         0.5605697 2.095168e-33
## Width          0.4294626 8.445613e-19
## Weight         0.2298540 4.913548e-06
## Cyl           -0.1071037 3.518485e-02
## HP            -0.4010809 2.175520e-16
## Retail.Price  -0.6430569 1.540011e-46
## Dealer.Cost   -0.6453051 5.917390e-47
## 
## 
## $Dim.3
## $Dim.3$quanti
##              correlation      p.value
## Highway.MPG    0.5521603 2.888889e-32
## City.MPG       0.4932248 4.059874e-25
## Length         0.3177542 1.581611e-10
## Wheel.Base     0.2437751 1.212886e-06
## Dealer.Cost    0.2371352 2.389210e-06
## Retail.Price   0.2350370 2.948038e-06
## Weight        -0.1034439 4.196587e-02
##    comp 1    comp 2    comp 3 
## 64.587622 17.126589  7.724803
##   comp 1   comp 2   comp 3 
## 64.58762 81.71421 89.43901
  • PCA with active and supplementary variables

PCA allows you to specify quantitative supplementary and qualitative supplementary variables.

  • PCA using the 10 non-binary numeric variables

dudi.pca() is the main function that implements PCA for ade4 package. Set the scannf argument to FALSE and use the nf argument for setting the number of axes to retain to suppress the interactive mode and insert the number of axes within the dudi.pca() function.

## Class: pca dudi
## Call: dudi.pca(df = cars[, 8:18], scannf = FALSE, nf = 4)
## 
## Total inertia: 11
## 
## Eigenvalues:
##     Ax1     Ax2     Ax3     Ax4     Ax5 
##  7.1046  1.8839  0.8497  0.3570  0.2754 
## 
## Projected inertia (%):
##     Ax1     Ax2     Ax3     Ax4     Ax5 
##  64.588  17.127   7.725   3.246   2.504 
## 
## Cumulative projected inertia (%):
##     Ax1   Ax1:2   Ax1:3   Ax1:4   Ax1:5 
##   64.59   81.71   89.44   92.68   95.19 
## 
## (Only 5 dimensions (out of 11) are shown)
## 
## Call:
## PCA(X = cars[, 9:19], ncp = 4, graph = T) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               7.105   1.884   0.850   0.357   0.275   0.198
## % of var.             64.588  17.127   7.725   3.246   2.504   1.799
## Cumulative % of var.  64.588  81.714  89.439  92.685  95.189  96.988
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.087   0.066   0.037   0.001
## % of var.              1.277   0.788   0.604   0.336   0.007
## Cumulative % of var.  98.266  99.053  99.657  99.993 100.000
## 
## Individuals (the 10 first)
##                  Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1            |  4.568 | -4.533  0.747  0.985 | -0.290  0.012  0.004 |
## 2            |  4.977 | -4.791  0.835  0.927 | -0.771  0.082  0.024 |
## 3            |  3.415 | -3.208  0.374  0.882 |  0.614  0.052  0.032 |
## 4            |  3.450 | -3.262  0.387  0.894 |  0.526  0.038  0.023 |
## 5            |  3.367 | -3.159  0.363  0.881 |  0.528  0.038  0.025 |
## 6            |  3.915 | -3.791  0.523  0.937 |  0.276  0.010  0.005 |
## 7            |  3.860 | -3.733  0.507  0.935 |  0.221  0.007  0.003 |
## 8            |  3.694 | -3.647  0.484  0.975 | -0.001  0.000  0.000 |
## 9            |  4.020 | -3.951  0.568  0.966 |  0.066  0.001  0.000 |
## 10           |  3.639 | -3.591  0.469  0.974 | -0.107  0.002  0.001 |
##               Dim.3    ctr   cos2  
## 1            -0.113  0.004  0.001 |
## 2            -0.451  0.062  0.008 |
## 3             0.828  0.209  0.059 |
## 4             0.798  0.194  0.053 |
## 5             0.875  0.233  0.068 |
## 6             0.741  0.167  0.036 |
## 7             0.769  0.180  0.040 |
## 8            -0.132  0.005  0.001 |
## 9             0.280  0.024  0.005 |
## 10           -0.074  0.002  0.000 |
## 
## Variables (the 10 first)
##                 Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## Retail.Price |  0.703  6.956  0.494 | -0.643 21.950  0.414 |  0.235  6.501
## Dealer.Cost  |  0.699  6.881  0.489 | -0.645 22.104  0.416 |  0.237  6.618
## Engine.Size  |  0.925 12.046  0.856 |  0.021  0.024  0.000 |  0.044  0.223
## Cyl          |  0.891 11.168  0.793 | -0.107  0.609  0.011 |  0.075  0.663
## HP           |  0.849 10.151  0.721 | -0.401  8.539  0.161 |  0.070  0.583
## City.MPG     | -0.828  9.640  0.685 |  0.005  0.001  0.000 |  0.493 28.629
## Highway.MPG  | -0.817  9.400  0.668 |  0.015  0.012  0.000 |  0.552 35.880
## Weight       |  0.896 11.312  0.804 |  0.230  2.804  0.053 | -0.103  1.259
## Wheel.Base   |  0.710  7.087  0.503 |  0.574 17.487  0.329 |  0.244  6.994
## Length       |  0.684  6.594  0.468 |  0.561 16.680  0.314 |  0.318 11.882
##                cos2  
## Retail.Price  0.055 |
## Dealer.Cost   0.056 |
## Engine.Size   0.002 |
## Cyl           0.006 |
## HP            0.005 |
## City.MPG      0.243 |
## Highway.MPG   0.305 |
## Weight        0.011 |
## Wheel.Base    0.059 |
## Length        0.101 |
  • Vactor map for the variables

  • Factor map for the individuals observations

  • Barplot for the variables with the highest cos2 in the 1st PC

  • Barplot for the variables with the highest cos2 in the 2nd PC

  • Factor map for the top 5 variables with the highest contributions

The following plots will identify variables contributions on the extracted principal components.

  • Factor map for the top 5 individuals with the highest contributions

  • Barplot for the variables with the highest contributions to the 1st PC

  • Barplot for the variables with the highest contributions to the 2nd PC

  • Biplot with no labels for all individuals with the geom argument

  • Ellipsoids for wheeltype columns

  • Biplot with ellipsoids

Advanced PCA & Non-negative matrix factorization (NNMF)

In this section I will cover how to deal with missing data using ldimensionality reduction technique called Non-negative matrix factorization (NNMF). This section will cover:

How many PCs to retain? - Kaiser-Guttman rule * Keep the PCs with eigenvalue > 1 - Scree test (constructing the screeplot) * Elbow - Parallel Analysis

Data

The data used for this analysis is The airquality dataset the from the datasets package. This dataset contains daily air quality measurements in New York, May to September 1973. The data consist of 153 observations and 6 variables.

  • Ozone numeric Ozone (ppb)
  • Solar.R numeric Solar R (lang)
  • Wind numeric Wind (mph)
  • Temp numeric Temperature (degrees F)
  • Month numeric Month (1–12)
  • Day numeric Day of month (1–31)

PCA

  • PCA on the airquality dataset

  • Kaiser-Guttman rule
## 
## Call:
## PCA(X = airquality) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               2.318   1.165   0.983   0.790   0.435   0.310
## % of var.             38.625  19.411  16.385  13.175   7.246   5.158
## Cumulative % of var.  38.625  58.036  74.421  87.596  94.842 100.000
## 
## Individuals (the 10 first)
##             Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1       |  2.582 | -0.570  0.092  0.049 | -1.539  1.329  0.355 | -0.229
## 2       |  2.404 | -0.663  0.124  0.076 | -0.922  0.477  0.147 | -0.437
## 3       |  2.473 | -1.536  0.665  0.386 | -1.246  0.871  0.254 | -0.834
## 4       |  3.101 | -1.536  0.665  0.245 | -2.467  3.416  0.633 | -0.148
## 5       |  3.225 | -2.191  1.354  0.462 | -1.668  1.561  0.267 | -0.136
## 6       |  2.653 | -1.948  1.071  0.540 | -1.549  1.346  0.341 | -0.368
## 7       |  2.667 | -0.947  0.253  0.126 | -2.050  2.358  0.591 |  0.257
## 8       |  3.101 | -2.668  2.008  0.741 | -0.737  0.305  0.057 | -0.302
## 9       |  4.380 | -3.841  4.161  0.769 | -0.329  0.061  0.006 | -0.874
## 10      |  1.863 | -0.679  0.130  0.133 | -1.106  0.687  0.353 |  0.455
##            ctr   cos2    Dim.4    ctr   cos2  
## 1        0.035  0.008 | -1.861  2.863  0.519 |
## 2        0.127  0.033 | -2.072  3.551  0.743 |
## 3        0.463  0.114 | -1.001  0.828  0.164 |
## 4        0.015  0.002 | -0.318  0.084  0.011 |
## 5        0.012  0.002 | -0.782  0.506  0.059 |
## 6        0.090  0.019 | -0.445  0.163  0.028 |
## 7        0.044  0.009 | -0.696  0.400  0.068 |
## 8        0.060  0.009 | -1.140  1.074  0.135 |
## 9        0.508  0.040 | -0.526  0.229  0.014 |
## 10       0.137  0.060 | -1.207  1.205  0.420 |
## 
## Variables
##            Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## Ozone   |  0.828 29.610  0.686 | -0.078  0.517  0.006 |  0.295  8.877
## Solar.R |  0.385  6.402  0.148 | -0.720 44.559  0.519 |  0.167  2.821
## Wind    | -0.715 22.029  0.511 | -0.178  2.719  0.032 | -0.200  4.072
## Temp    |  0.866 32.341  0.750 |  0.056  0.267  0.003 | -0.126  1.612
## Month   |  0.447  8.608  0.199 |  0.558 26.725  0.311 | -0.514 26.881
## Day     | -0.153  1.010  0.023 |  0.542 25.212  0.294 |  0.740 55.737
##           cos2    Dim.4    ctr   cos2  
## Ozone    0.087 | -0.082  0.854  0.007 |
## Solar.R  0.028 |  0.481 29.245  0.231 |
## Wind     0.040 |  0.493 30.782  0.243 |
## Temp     0.016 |  0.127  2.039  0.016 |
## Month    0.264 |  0.404 20.665  0.163 |
## Day      0.548 |  0.360 16.416  0.130 |

Screeplot test

Parallel analysis

  • Parallel analysis with paran()
## 
## Using eigendecomposition of correlation matrix.
## Computing: 10%  20%  30%  40%  50%  60%  70%  80%  90%  100%
## 
## 
## Results of Horn's Parallel Analysis for component retention
## 180 iterations, using the mean estimate
## 
## -------------------------------------------------- 
## Component   Adjusted    Unadjusted    Estimated 
##             Eigenvalue  Eigenvalue    Bias 
## -------------------------------------------------- 
## 1           2.132182    2.468840      0.336658
## -------------------------------------------------- 
## 
## Adjusted eigenvalues > 1 indicate dimensions to retain.
## (1 components retained)
  • Suggested number of PCs to retain using parallel analysis
## [1] 1
  • Parallel analysis with fa.parallel()

## Parallel analysis suggests that the number of factors =  3  and the number of components =  1
  • Suggested number of PCs to retain using parallel analysis
## [1] 1

Missing values

Estimation methods for PCA methods:

  • Impute the missing values based on mean of the variable that includes NA values
  • Impute the missing values based on a linear regression regression model
  • Estimate missing values with PCA
  • missMDA and then FactoMineR
  • pcaMethods

Mean imputation is problematic because it will distort the distribution of the variables if the data has a lot of missing values.

  • Determine if there is the missing data
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 
  • The number of cells with missing values
## [1] 44
  • The number of rows with missing values
## [1] 42

pca()

pca(): - Uses regression methods for approximation of the correlation matrix - Compiles PCA models Projects the new points back into the original space

  • Estimate the optimal number of dimensions for imputation
## $ncp
## [1] 0
## 
## $criterion
##        0        1        2        3        4        5 
## 1520.506 1823.946 1771.702 2774.323 2888.306 6369.592

The dataset contains 2,225 articles from the BBC news Ibsite corresponding to stories in five topical areas from years 2004-2005. Each article is labeled with one of the following five classes: business, entertainment, politics, sport, and tech.

## [1] 86
## [1] 41
## $ncp
## [1] 4
## 
## $criterion
##        0        1        2        3        4        5 
## 63691268 42181039  7984657  5168754  2157407  2498253

Exploratory factor analysis (EFA)

Exploratory factor analysis (EFA) is a dimensionality reduction technique that is a natural extension to PCA. It is suggested to use EFA instead PCA when the variables are of ordinal type.

Data

hsq contains the Humor Styles Questionnaire [HSQ] dataset, which includes responses from 1071 participants on 32 questions. The polychoric correlation was calculated using the mixedCor() function of the psych package.

EFA

  • Dimensionality of hsq
## [1] 1071   39
  • Correlation object hsq_correl exploration
## List of 6
##  $ rho  : num [1:32, 1:32] 1 -0.2094 -0.1772 -0.0945 -0.4466 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##  $ rx   : NULL
##  $ poly :List of 4
##   ..$ rho  : num [1:32, 1:32] 1 -0.2094 -0.1772 -0.0945 -0.4466 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   ..$ tau  : num [1:32, 1:6] -2.77 -2.77 -2.9 -3.11 -2.9 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. .. ..$ : chr [1:6] "1" "2" "3" "4" ...
##   ..$ n.obs: int 1071
##   ..$ Call : language polychoric(x = data[, p], smooth = smooth, global = global, weight = weight,      correct = correct)
##   ..- attr(*, "class")= chr [1:2] "psych" "poly"
##  $ tetra:List of 2
##   ..$ rho: NULL
##   ..$ tau: NULL
##  $ rpd  : NULL
##  $ Call : language mixedCor(data = hsq, c = NULL, p = 1:32)
##  - attr(*, "class")= chr [1:2] "psych" "mixed"
  • Correlation matrix of the dataset
hsq_polychoric <- hsq_correl$rho
  • Correlation structure exploration

  • The Bartlett sphericity test

H0: There is no significant difference between the correlation matrix and the identity matrix of the same dimensionality. H1: There is significant difference betweeen them and, thus, we have strong evidence that there are underlying factors.

EFA is suitable when the Bartlett sphericity test result is less than 0.05 (statistically significant).

## $chisq
## [1] 1114.409
## 
## $p.value
## [1] 1.610583e-49
## 
## $df
## [1] 496
  • Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy

The closer the value is to 1 the more effectively and reliably the reduction will be. The factorability tests suggest that I can proceed in reducing hsq dimensionality.

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = hsq_polychoric)
## Overall MSA =  0.87
## MSA for each item = 
##   Q1   Q2   Q3   Q4   Q5   Q6   Q7   Q8   Q9  Q10  Q11  Q12  Q13  Q14  Q15 
## 0.94 0.93 0.91 0.90 0.91 0.88 0.82 0.86 0.95 0.86 0.78 0.90 0.85 0.93 0.82 
##  Q16  Q17  Q18  Q19  Q20  Q21  Q22  Q23  Q24  Q25  Q26  Q27  Q28  Q29  Q30 
## 0.85 0.87 0.83 0.89 0.83 0.87 0.84 0.81 0.84 0.83 0.89 0.83 0.93 0.87 0.81 
##  Q31  Q32 
## 0.81 0.91
  • PAF on hsq_polychoric

Let’s look at another popular extraction method, Principal Axis Factoring (PAF). PAF’s main idea is that communality has a central role in extracting factors, since it can be interpreted as a measure of an item’s relation to all other items. An iterative approach is adopted. Initially, an estimate of the common variance is given in which the communalities are less than 1. After replacing the main diagonal of the correlation matrix (which usually consists of ones) with these estimates of the communalities, the new correlation matrix is updated and further replacements are repeated based on the new communalities until a number of iterations is reached or the communalities converge to a point that there is too little difference between two consecutive communalities.

hsq_correl_pa <- fa(hsq_polychoric, nfactors=4, fm="pa")
  • Sort the communalities of the f_hsq_pa

Identify variables that load well on the chosen factors

f_hsq_pa_common <- sort(hsq_correl_pa$communality, decreasing = TRUE)
f_hsq_pa_common
##       Q20       Q17       Q25       Q18        Q8       Q10       Q21 
## 0.6126774 0.5915220 0.5837439 0.5583484 0.5575454 0.5499162 0.5374533 
##       Q26       Q14       Q32       Q13       Q31       Q15        Q1 
## 0.5246559 0.5189606 0.5184342 0.5011976 0.5003349 0.4972556 0.4790009 
##       Q12        Q5        Q2        Q4       Q29        Q7        Q6 
## 0.4570093 0.4327538 0.4079485 0.4069703 0.3949128 0.3650824 0.3649526 
##        Q3       Q19       Q16       Q11       Q27       Q24       Q23 
## 0.3634246 0.3472616 0.3225140 0.3018913 0.2992410 0.2866446 0.2719056 
##       Q30        Q9       Q28       Q22 
## 0.2709174 0.2671010 0.2415143 0.1277128
  • Sort the uniqueness of the f_hsq_pa
f_hsq_pa_unique <- sort(hsq_correl_pa$uniqueness, decreasing = TRUE)
f_hsq_pa_unique
##       Q22       Q28        Q9       Q30       Q23       Q24       Q27 
## 0.8722872 0.7584857 0.7328990 0.7290826 0.7280944 0.7133554 0.7007590 
##       Q11       Q16       Q19        Q3        Q6        Q7       Q29 
## 0.6981087 0.6774860 0.6527384 0.6365754 0.6350474 0.6349176 0.6050872 
##        Q4        Q2        Q5       Q12        Q1       Q15       Q31 
## 0.5930297 0.5920515 0.5672462 0.5429907 0.5209991 0.5027444 0.4996651 
##       Q13       Q32       Q14       Q26       Q21       Q10        Q8 
## 0.4988024 0.4815658 0.4810394 0.4753441 0.4625467 0.4500838 0.4424546 
##       Q18       Q25       Q17       Q20 
## 0.4416516 0.4162561 0.4084780 0.3873226
  • Scree test and the Kaiser-Guttman criterion

## Parallel analysis suggests that the number of factors =  7  and the number of components =  NA
  • Parallel analysis for estimation with the minres extraction method

The charts show both eigen values for principal components and principal axis factor analysis

## Parallel analysis suggests that the number of factors =  7  and the number of components =  5
  • Parallel analysis for estimation with the mle extraction method

## Parallel analysis suggests that the number of factors =  7  and the number of components =  5

Based on the three tests conducted, 4 factors should be retained.

Advanced EFA

This section will cover advanced applications of EFA.

## [1] "oblimin"
## [1] "promax"
## [1] "varimax"

The Varimax rotation method is most suitable for arriving at the most interpretable EFA model on the HSQ dataset. Decision on the rotation method is based on the clarity of the path diagram and the interpretability of arrow connections,

The loadings’ matrix is accessible through the loadings attribute.

## 
## Loadings:
##     MR1    MR2    MR4    MR3   
## Q1   0.675 -0.055 -0.005  0.029
## Q2  -0.085 -0.023  0.604 -0.034
## Q3  -0.113  0.130  0.082 -0.512
## Q4   0.018  0.635  0.023  0.002
## Q5  -0.599  0.066  0.095 -0.015
## Q6  -0.257 -0.038  0.462 -0.040
## Q7  -0.166 -0.030  0.002  0.607
## Q8  -0.027  0.741  0.009  0.007
## Q9   0.485 -0.124  0.011  0.016
## Q10  0.034  0.056  0.736 -0.020
## Q11  0.139 -0.018  0.163 -0.541
## Q12 -0.142  0.663 -0.045  0.054
## Q13 -0.641  0.006  0.143  0.005
## Q14 -0.173 -0.018  0.644  0.022
## Q15  0.055 -0.044  0.078  0.688
## Q16  0.123 -0.523  0.118  0.126
## Q17  0.769 -0.035  0.043  0.050
## Q18  0.107  0.005  0.780 -0.004
## Q19 -0.122  0.141  0.229 -0.412
## Q20  0.099  0.779 -0.021 -0.082
## Q21 -0.641  0.131  0.153  0.156
## Q22  0.136  0.059 -0.267  0.105
## Q23  0.124  0.036  0.004  0.491
## Q24  0.218  0.509  0.083  0.042
## Q25  0.761  0.061 -0.020  0.007
## Q26 -0.078  0.033  0.685  0.032
## Q27  0.103  0.058  0.114 -0.530
## Q28 -0.075  0.272  0.248 -0.155
## Q29  0.607  0.182  0.068  0.151
## Q30  0.005 -0.133  0.538 -0.008
## Q31  0.107  0.066  0.080  0.693
## Q32 -0.074  0.694  0.051  0.023
## 
##                  MR1   MR2   MR4   MR3
## SS loadings    3.768 3.241 3.233 2.688
## Proportion Var 0.118 0.101 0.101 0.084
## Cumulative Var 0.118 0.219 0.320 0.404

HSQ measures two positive features for styles of humor:

  1. affiliative: ‘Q1’, ‘Q5’, ‘Q9’, ‘Q13’, ‘Q17’, ‘Q21’, ‘Q25’, ‘Q29’
  2. self-enhancing: ‘Q2’, ‘Q6’, ‘Q10’, ‘Q14’, ‘Q18’, ‘Q22’, ‘Q26’, ‘Q30’

HSQ measures two negative features for styles of humor:

  1. aggressive: ‘Q3’, ‘Q7’, ‘Q11’, ‘Q15’, ‘Q19’, ‘Q23’, ‘Q27’, ‘Q31’
  2. self-defeating: ‘Q4’, ‘Q8’, ‘Q12’, ‘Q16’, ‘Q20’, ‘Q24’, ‘Q28’, ‘Q32’

The extracted factors MR1 could measure the affiliative style. Thisfactor maps to most or all of the questions that correspond to the affiliative style. The classification of the questionnaire items are listed above.

Data

The Short Dark Triad (SD3) dataset that resulted from measuring the 3 dark personality traits: - machiavellianism (a manipulative behaviour) - narcissism (excessive self-admiration) - psychopathy (lack of empathy)

Interactive version of the test: https://openpsychometrics.org/tests/SD3/

EFA: The steps

  • Check for data factorability
  • Choose the “right” number of factors to retain
  • Extract factors
  • Rotate factors
  • Interpret the results

Data factorability

  • Explore sdt_sub_correl

The sdt_sub_correl has been calculated with the hetcor() function of the polycor package.

## List of 7
##  $ correlations: num [1:27, 1:27] 1 0.184 0.102 0.217 0.369 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
##   .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
##  $ type        : chr [1:27, 1:27] "" "Pearson" "Pearson" "Pearson" ...
##  $ NA.method   : chr "complete.obs"
##  $ ML          : logi FALSE
##  $ std.errors  : num [1:27, 1:27] 0 0.0969 0.0993 0.0956 0.0868 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
##   .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
##  $ n           : int 100
##  $ tests       : num [1:27, 1:27] 0.00 5.78e-13 1.55e-16 8.63e-14 4.36e-14 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
##   .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
##  - attr(*, "class")= chr "hetcor"

Correlation matrix of the sdt_sub_correl

## $chisq
## [1] 1019.442
## 
## $p.value
## [1] 2.054927e-66
## 
## $df
## [1] 351
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = sdt_polychoric)
## Overall MSA =  0.82
## MSA for each item = 
##   M1   M2   M3   M4   M5   M6   M7   M8   M9   N1   N2   N3   N4   N5   N6 
## 0.78 0.84 0.80 0.66 0.91 0.84 0.68 0.77 0.79 0.80 0.82 0.83 0.87 0.85 0.84 
##   N7   N8   N9   P1   P2   P3   P4   P5   P6   P7   P8   P9 
## 0.80 0.81 0.89 0.89 0.64 0.87 0.52 0.81 0.88 0.52 0.63 0.85

Choose the “right” number of factors to retain

The number of factors recommended is 6.

  • Parallel analysis for estimation with the minres extraction method

Conduct parallel analysis for estimation with the minres extraction method and the checking the Kaiser-Guttman criterion.

## Parallel analysis suggests that the number of factors =  4  and the number of components =  NA

The Kaiser-Gutman and the Scree test suggest 3 and 4 factors

Extract factors

  • EFA with MLE extraction method

A total 4 factors are extracted with the maximum likelihood estimation extraction method

Rotate factors

  • Factor loadings
## 
## Loadings:
##    ML1    ML4    ML2    ML3   
## M1  0.005  0.043  0.578 -0.194
## M2  0.236  0.407  0.193  0.152
## M3 -0.019  0.654  0.023  0.091
## M4  0.029  0.329  0.254 -0.134
## M5  0.184  0.179  0.550  0.075
## M6  0.064 -0.099  0.849  0.055
## M7  0.104  0.171  0.438 -0.454
## M8  0.504  0.255 -0.025 -0.183
## M9  0.048  0.325  0.450  0.037
## N1  0.082  0.202  0.033  0.409
## N2  0.037 -0.160 -0.105 -0.501
## N3  0.221  0.056  0.012  0.615
## N4 -0.014  0.438  0.160  0.372
## N5 -0.059  0.580  0.107  0.166
## N6 -0.299 -0.300  0.104 -0.356
## N7 -0.189  0.346  0.222  0.219
## N8 -0.197 -0.058 -0.276 -0.334
## N9  0.754 -0.003  0.014 -0.017
## P1  0.411  0.012  0.296  0.053
## P2  0.001 -0.129 -0.089 -0.213
## P3  0.395 -0.008  0.220  0.020
## P4  0.015  0.104 -0.111  0.318
## P5  0.556  0.026  0.076  0.070
## P6  0.634 -0.047  0.174  0.139
## P7 -0.419  0.131  0.190 -0.016
## P8  0.101  0.594 -0.179 -0.277
## P9  0.261  0.525 -0.049  0.084
## 
##                  ML1   ML4   ML2   ML3
## SS loadings    2.445 2.432 2.304 1.844
## Proportion Var 0.091 0.090 0.085 0.068
## Cumulative Var 0.091 0.181 0.266 0.334
  • Path diagram of the latent factors

The path diagram help with drawing conclusions about the underlying factors in the dataset.

Interpret the results

The twenty seven statements of the short dark driad test correspond well to the three personality traits

  • machiavellianism (a manipulative attitude)
  • narcissism (excessive self-love)
  • psychopathy (lack of empathy)